feat(eval): add ClassifierEvaluator (pure-metadata aggregator) by ajay-kesavan · Pull Request #1674 · UiPath/uipath-python

ajay-kesavan · 2026-05-21T17:00:57Z

Summary

Adds a new evaluator type whose role is to carry a classes list and a source_evaluator name to downstream consumers (the C# Studio Web backend). It does not compute classification metrics per datapoint — that work moves out of the SDK and into the C# layer, which scans each datapoint's agent output for the configured class strings and builds the confusion matrix after the per-datapoint loop finishes.

Replaces the earlier draft architecture in #1669 / #5307 (Python dataset evaluator framework + Temporal worker workflow). The pure-metadata approach is ~50 LOC instead of ~1500 LOC and ships through the existing CLI → backend wire path with zero new endpoints.

How it works

Eval set:
  evaluatorRefs:
    - intent_match            ExactMatch (existing) — produces expected/actual per datapoint
    - intent_classifier       NEW ClassifierEvaluator — carries classes list

CLI / SDK runtime (this PR):
  For each datapoint:
    ExactMatch.evaluate(...)        → result with BaseEvaluatorJustification(expected, actual)
    ClassifierEvaluator.evaluate()  → result with ClassifierJustification(classes, source_evaluator)
                                     (no per-datapoint computation; the result is metadata)
  CLI POSTs both results to C# via the existing per-evaluator-run update path.

C# (in companion Agents PR):
  After per-datapoint loop, the C# detects the classifier evaluator by inspecting
  Justification payloads, reads (output, expected_class) per datapoint, builds the
  confusion matrix + per-class TP/TN/FP/FN, persists into EvaluatorScores envelope.

Files

New

eval/evaluators/classifier_evaluator.py — ClassifierEvaluator + ClassifierEvaluatorConfig + ClassifierJustification
tests/evaluators/test_classifier_evaluator.py — 9 unit tests

Modified

eval/models/models.py — EvaluatorType.CLASSIFIER = "uipath-classifier"
eval/evaluators/evaluator.py — discriminator + CodedEvaluator union entry
eval/evaluators/__init__.py — re-export + EVALUATORS list entry

Total: 5 files, +297 / -0.

Test plan

pytest tests/evaluators/test_classifier_evaluator.py — 9 tests passing
pytest tests/evaluators tests/cli/eval — 824 passing (815 existing + 9 new), zero regressions
ruff check / ruff format / mypy — clean on all changed files
Factory smoke: EvaluatorFactory.create_evaluator({"version":"1.0","evaluatorTypeId":"uipath-classifier", ...}) builds it correctly
Per-datapoint smoke: evaluate() returns score=0.0 + ClassifierJustification(classes=..., source_evaluator=...) with the wire JSON shape that the C# layer expects
End-to-end via uipath eval against a real eval set with a classifier — pending companion Agents PR landing

Disposition

This branch supersedes the SDK changes in #1669 (Python dataset evaluator framework). I'll close #1669 once this lands.

🤖 Generated with Claude Code

Adds a new evaluator type whose role is to carry a `classes` list and a `source_evaluator` name to downstream consumers. It does not compute classification metrics per datapoint — that work moves to the Studio Web C# backend, which reads each datapoint's agent output and the source evaluator's expected label after the per-datapoint loop finishes, scans the output for each configured class, and builds the confusion matrix. The per-datapoint evaluate() returns score=0.0 with a ClassifierJustification(classes, source_evaluator) details payload. This payload survives the existing CLI -> backend wire path via StudioWebProgressReporter._serialize_justification (json.dumps of the model_dump), arriving in the backend as a JSON string inside CodedEvaluatorScore.Justification where the C# layer can read it. Replaces the design in earlier draft PRs #1669 and #5307: the SDK no longer owns the dataset-level computation. The pure-config approach is ~50 LOC instead of ~1500 LOC of dataset-evaluator framework + worker workflow + factory + child workflow plumbing. Files: src/uipath/eval/evaluators/classifier_evaluator.py new (~90 LOC) src/uipath/eval/evaluators/__init__.py re-export + EVALUATORS list src/uipath/eval/evaluators/evaluator.py discriminator + Union entry src/uipath/eval/models/models.py EvaluatorType.CLASSIFIER tests/evaluators/test_classifier_evaluator.py 9 unit tests, all passing Verified: pytest tests/evaluators tests/cli/eval --no-cov -> 824 passed ruff check / ruff format / mypy -> clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A minimal 3-class intent classification agent (book / cancel / reschedule) that exercises the new ClassifierEvaluator end-to-end via `uipath eval`. Mirrors the wire shape Studio Web will see once the C# backend and frontend PRs land, so SDK changes can be validated standalone before the full stack is brought up. Layout: main.py — keyword classifier returning {"intent": "..."} evaluations/ eval-sets/main.json evaluators/ intent_match.json per-datapoint ExactMatch on .intent intent_classifier.json new uipath-classifier with classes + sourceEvaluator README.md — Path A (SDK CLI) + Path B (Studio Web) instructions Each datapoint has `evaluationCriterias.intent_classifier: {}` (the runtime skips evaluators that aren't keyed there). 6/9 datapoints are correctly classified by design; the resulting (expected, actual) pairs flow through the existing CLI -> backend wire path inside the classifier's justification payload as classes/source_evaluator metadata. Verified live: - ExactMatch averages to 0.7 (6/9 correct). - ClassifierEvaluator emits {"expected":"","actual":"","classes":[...], "source_evaluator":"intent_match"} per datapoint. - Plugging the (expected, actual) pairs from the resulting output into the same confusion-matrix math the C# helper implements yields macro F1 of 0.667 on this fixture — the number Studio Web's Aggregations panel would render once the backend pipeline is live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pydantic's generic resolution leaves T = typing.Any when a TypeVar is parameterized with its own bound (BaseEvaluationCriteria here), so BaseEvaluator[BaseEvaluationCriteria, ...] tripped the runtime's "X must be a subclass of BaseEvaluationCriteria" guard at load time: Failed to create evaluator from file 'evaluations/evaluators/classifier-*.json': typing.Any must be a subclass of BaseEvaluationCriteria. Introduce an empty ClassifierEvaluationCriteria(BaseEvaluationCriteria) subclass and parameterize Config + Evaluator with it. Mirrors how every other built-in evaluator (ExactMatch via OutputEvaluationCriteria, etc.) provides a concrete criteria type even when no per-datapoint fields are needed.

Replaces the standalone ClassifierEvaluator with an `aggregators` config field on per-datapoint evaluators (ExactMatch first). Run-level classification metrics are now driven by the host evaluator's config, not by a separate evaluator with a source-evaluator ID reference. Design rationale (see Confluence "Design for Precision and Recall" §5.2): the standalone evaluator forced users to add TWO evaluators and copy an opaque ID between them. Moving aggregator config onto the evaluator that already emits the labels keeps the source of truth in one place and makes the JSON file portable across conversions (e.g. low-code -> coded). - New module `_aggregators.py` with AggregatorSpec / ClassificationAggregatorSpec - ExactMatchEvaluatorConfig gains optional `aggregators: list[AggregatorSpec] | None` The Python runtime ignores the field; it's metadata for the downstream C# aggregation pass. - `_progress_reporter.py:_build_evaluator_snapshot` now also emits `aggregators` so the field flows into EvaluatorRun.EvaluatorSnapshot and the C# layer can discover it without consulting the eval set definition file separately. Bug fix: previously the builder only emitted prompt+model (LLM-judge only), so for ExactMatch the dict was empty and the snapshot ended up null in the wire payload. - ClassifierEvaluator, ClassifierEvaluationCriteria, ClassifierJustification, ClassifierEvaluatorConfig: all deleted. - EvaluatorType.CLASSIFIER enum value removed. - Discriminator union in evaluator.py drops the Classifier branch. Version bump 2.10.70 -> 2.10.72 (the previous .71 was an unused dev cache-bust). The new ExactMatch.aggregators field is a public API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-24T02:21:40Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 90%)

See analysis details on SonarQube Cloud

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 21, 2026

ajay-kesavan marked this pull request as ready for review May 21, 2026 17:34

This was referenced May 22, 2026

feat(eval): add dataset-level evaluator framework with precision/recall/f-score #1669

Closed

feat(eval): classification evaluator schemas + sample projects + e2e tests #1663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674
ajay-kesavan wants to merge 4 commits into
mainfrom
feat/eval-classifier-evaluator

ajay-kesavan commented May 21, 2026

Uh oh!

sonarqubecloud Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajay-kesavan commented May 21, 2026

Summary

How it works

Files

Test plan

Disposition

Uh oh!

sonarqubecloud Bot commented May 24, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant